Summarization Algorithms for Record Linkage
نویسندگان
چکیده
Record linkage has received significant attention in recent years due to the plethora of data sources that have to be integrated to facilitate data analyses. In several cases, such an integration involves disparate data sources containing huge volumes of records and must be performed in near real-time in order to support critical applications. In this paper, we propose the first summarization algorithms for speeding up online record linkage tasks. Our first method, called SkipBloom, summarizes efficiently the participating data sets, using their blocking keys, to allow for very fast comparisons among them. The second method, called BlockSketch, summarizes a block to achieve a constant number of comparisons for a submitted query record, during the matching phase. Additionally, we extend BlockSketch to adapt its functionality to streaming data, where the objective is to use a constant amount of main memory to handle potentially unbounded data sets. Through extensive experimental evaluation, using three real-world data sets, we demonstrate the superiority of our methods against two state-of-the-art algorithms for online record linkage.
منابع مشابه
Probabilistic Linkage of Persian Record with Missing Data
Extended Abstract. When the comprehensive information about a topic is scattered among two or more data sets, using only one of those data sets would lead to information loss available in other data sets. Hence, it is necessary to integrate scattered information to a comprehensive unique data set. On the other hand, sometimes we are interested in recognition of duplications in a data set. The i...
متن کاملEfficient Record Linkage Algorithms Using Complete Linkage Clustering.
Data from different agencies share data of the same individuals. Linking these datasets to identify all the records belonging to the same individuals is a crucial and challenging problem, especially given the large volumes of data. A large number of available algorithms for record linkage are prone to either time inefficiency or low-accuracy in finding matches and non-matches among the records....
متن کاملAn Empirical Comparison of Approaches to Approximate String Matching in Private Record Linkage
Due to the frequency of spelling and typographical errors in practical applications, record linkage algorithms have to use string similarity functions. In many legal contexts, identifiers such as names have to be encrypted before a record linkage can be attempted. Therefore, algorithms for computing string similarity functions with encrypted identifiers are essential for approximating string ma...
متن کاملCreating a large database test bed with typographical errors for record linkage evaluation.
Evaluation of record linkage algorithms requires a large database test bed that is representative of the real-world data. We created such a large database that reflects the demographic distribution of a typical population and contains typographical errors commonly made during data entry. This database can be used with high confidence as a test bed to evaluate various record linkage algorithms.
متن کاملEvaluating Genetic Algorithms for selection of similarity functions for record linkage
Machine learning algorithms have been successfully employed in solving the record linkage problem. Machine learning casts the record linkage problem as a classification problem by training a classifier that classifies 2 records as duplicates or unique. Irrespective of the machine learning algorithm used, the initial step in training a classifier involves selecting a set of similarity functions ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2018